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We present here algorithms for efficient computation of hnear algebra prob- 
lems over finite fields. Implementations^ of the proposed algorithms are available 
through the Magma, Maple (within the LinearAlgebra [Modular] subpackage) 
and Sage systems; some parts can also be found within the C/C-I--I- libraries NTL, 
FLINT, IML, M4RI and the special purpose LinBox template library for exact, 
high-performance linear algebra computation with dense, sparse, and structured 
matrices over the integers and over finite fields [16]. 

1 Dense matrix multiplication 

Definition 1 For A G and B e F^'^" with elements Aij and Bij, the matrix 

C — A X B has Cij — X]f=i denote by MM{m, k,n) a time complexity 

bound on the number of field operations necessary to compute C. 

Classical triple loop implementation of matrix multiplication makes MM(m, fc, n) < 
2mkn. The best published estimates to date gives Wl{n,n,n) < ©(n") with uj w 
2.3755 [13], though improvements to 2.3737 and 2.3727 are now claimed [50, 53]. 
For very rectangular matrices one also have astonishing results like MM(n, n,n") < 
0(0 for a constant a > 0.294 and any e > [12]. Nowadays practical im- 

plementations mostly use Strassen-Winograd's algorithm, see section 1.4, with an 
intermediate complexity and uj « 2.8074. 

1.1 Tiny finite fields 

The practical efficiency of matrix multiplication depends highly on the representation 
of field elements. We thus present three kinds of compact representations for ele- 
ments of a finite field with very small cardinality: bitpacking (for F2), bit-slicing (for 
say F3, F5, F7, F23, or F32) and Kronecker substitution. These representations are 
designed to allow efficient linear algebra operations, including matrix multiplication. 

Definition 2 [5] Bitslicing consists in representing an n-dimensional vector ofk-bit 
sized coefficients using k binary vectors of dimension n. In particular, one can use 
boolean word instruction to perform arithmetic on 64 dimensional vectors. 

• Over F3, the binary representation = [0, 0], 1 = [1, 0], — 1 = [11] allows to add 
and subtract two elements in 6 boolean operations: 

Add{[x(3,xi\, [2/0, J/i]) : s ^ ® yi, t ^ xi ® j/o 

Return{s At,{s<S) Xi) V (t © yi)) 

Sub{[xo,Xi],[yo,yi]) : t ^ xq ® yo 

Return{t V (xi ® yi), [t ® yi) A (2/0 © Xi)) 

'^http: //magma. maths .usyd. edu. au, http://www.maplesoft.com, http://sagemath.org, http: 
//www . shoup.net/ntl, http : / /www. f lintlib . org, http : //www . cs .uwaterloo . ca/~astor joh/iml . 
html, http://m4ri.sagemath.org, http://linalg.org 
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Algorithm 1.1 [Greasing] 

Over F2, the method of the four Russians [2], also called Greasing can be used as 
follows: 

• A 64 bit machine word can be used to represent a row vector of dimension 64. 

• Matrix multiplication of a m x fc matrix Ahy a, k xn matrix B can be done by 
first storing all 2*^ fc-dimensional linear combinations of rows of i3 in a table. 
Then the i-th row of the product is copied from the row of the table indexed 
by the i-th row of A. 

• By ordering indices of the table according to a binary Gray Code, each row of 
the table can be deduced from the previous one, using only one row addition. 
This brings the bit operation count to build the table from fc2'^n to 2'^n. 

• Choosing k ~ log2 n in the above method implies MM(n) = 0{n'^/\ogn) 
over F2. 



• Over ¥5 (resp. F7 j, a redundant representation x = + 2a;i +4x2 = [2^072:1,2] 
allows to add two elements in 20 (resp. 17) boolean operations, negate in 3 
(resp. 6) boolean operations and double in (resp. 5) boolean operations. 





F3 


F5 


Fj 


Addition 


6 


20 


17 


Negation 


1 


5 


3 


Double 




5 






Table 1: boolean operation counts for basic arithmetic using bit slicing 



Definition 3 Bitpacking consists in representing a vector of field elements as an 
integer fitting in a single machine word using a 2^-adic representation: 

(xo, . . . , a;„_i) e F;' = a = xo + 2''xi + ■■■ + (2'=)"-ix„_i G Z264 

Elements of extension fields are viewed as polynomials and stored as the evaluation 
of this polynomial at the characteristic of the field. The latter evaluation is called 
Kronecker substitution. 

We first need a way to simultaneously reduce coefficients modulo the characteristic, 
see [14]. 

Once we can pack and simultaneously reduce coefficients of finite field in a sin- 
gle machine word, the obtained parallelism can be used for matrix multiplication. 
Depending on the respective sizes of the matrix in the multiplication one can pack 
only the left operand or only the right one or both [15]. We give here only a generic 
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Algorithm 1.2 [REDQ: Q-adic REDuction] 



Input: Three integers p, q and f = J2i=ol^i9^ ^ ^• 
Output: p G Z, with p = X^iLo Mi?' where /i^ — /a; mod p. 
REDQ COMPRESSION 



for i = to d do 



end for 



P 



REDQ CORRECTION 

for i = to d — 1 do 

fit ^ Ui - qui+i modp; 
end for 



{only when p \ q, otherwise /i^ — Ui is correct} 



9: Return p = Y,i=o Mj? 



Algorithm 1.3 [Right packed matrix multiplication] 

Input: A prime p and Ac E F™^*^ and Be G F^^", stored with several field elements 
per machine word. 

Output: Cc ^ Ac X Be e F™^" 
1: A — Uncompress(^c); {extract the coefficients} 

2: Cc — Ax Be, {Using e.g., algorithm 1.5} 

3: Return REDQ(Cc); 
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algorithm for packed matrices, which use multiphcation of a right packed matrix by 
a non packed left matrix. 

Then, over extensions, fast floating point operations can be used on the Kronecker 
substitution of the elements. Indeed, it is very often desirable to use floating point 
arithmetic, exactly. For instance floating point routines can more easily use large 
hardware registers, they can more easily optimize the memory hierarchy usage [30, 
57] and portable implementations are more widely available. We present next the 
dot product and the matrix multiplication is then straightforward [17, 14, 15]. 



Algorithm 1.4 [Compressed Dot product over extension fields] 
Input: A field ¥pk with elements represented as exponents of a generator of the 
field; 

Input: two vectors vi and V2 of elements of Fpt; 
Input: a sufficiently large integer q. 
Output: R e Fpfc, with R — vf ■ V2- 

{Tabulated conversion: uses tables from exponent to floating point evaluation} 
1: Set vi and V2 to the floating point Kronecker substitution of the elements of vi 
and V2- 

2: Compute f = iJi'^ ■ V2', {The floating point computation} 

3: r = REDQ_COMPRESSION{f,p, q); {Computing a radix decomposition} 

{Variant of REDQ.CORRECTION: fi, = Pi mod p for f = E?=o ^ 
4: Set L = representation{^'^^Q fiiX^); 
5: Set H = representation{X^-^ x E?=fc-i l^iX"-^"'^)] 

6; Return R = H + L E ¥pk ; {Reduction in the field} 



1.2 Word size prime fields 

Over word-size prime fields one can also use the reduction to floating point routines 
of algorithm 1.4. The main point is to be able to perform efficiently the matrix 
multiplication of blocks of the initial matrices without modular reduction. Thus 
delaying the reduction as much as possible, depending on the algorithm and internal 
representations, in order to amortize its cost. We present next such a delaying with 
the classical matrix multiplication algorithm and a centered representation [18]. 

1.3 Large finite fields 

If the field is too large for the strategy 1.5 over machine words, then two main 
approaches would have to be considered: 

• Use extended arithmetic, either arbitrary of fixed precision, if the characteristic 
is large, and a polynomial representation for extension flelds. The difficulty 
here is to preserve an optimized memory management and to have an almost 
linear time extended precision polynomial arithmetic. 
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Algorithm 1.5 [f gemm: Finite Field GEneric Matrix Multiplication] 

Input: An odd prime p of size smaller than the floating point mantissa /3 and 

elements stored by values between and 
Input: A e F™^'^ and B G F^^" 
Output: C = Ax B e F^""" 
1: if n{p- f)2 < 2^+1 then 

2: Convert A and B to floating point matrices Af and Bf] 
3: Use floating point routines to compute C/ = Af x Bf; 
4: C = Cf mod p; 
5: else 

6: Cut A and B into smaller blocks; 

7: Call the algorithm recursively for the block multiplications; 
8: Perform the block additions modulo p; 
9; end if 



• Use a residue number system and an evaluation/interpolation scheme: one can 
use algorithm f .5 for each prime in the RNS and each evaluation point. For 
¥pk , the number of needed primes is roughly 2 logj^ (p) and the number of 
evaluations points is 2fc — 1. 



1.4 Large matrices: subcubic time complexity 

With matrices of large dimension, sub-cubic time complexity algorithms, such as 
Strassen-Winograd's [59] can be used to decrease the number of operations. Algo- 
rithm 1.4 describes how to compute one recursive level of the algorithm, using seven 
recursive calls and 15 block additions. 



Algorithm 1.6 [Strassen-Winograd] 



A^ 



'All Ai2 
A21 A22 



Bii B12 
B21 B22 



Cii C12 
C21 C22 



51 ^ A2l+A22\ 

52 <— 5*1 — All] 

53 ^ A11-A21; 

54 4- A12 ~ S2; 

Cii^Pi + P2; 

Ci2^Ui+P3- 



Ti 4- B12 — Bii\ 
T2 4— B22 — Ti ; 

73 4— B22 — B12 ; 

74 <— T2 — ^21; 

U2^Pi+ Pe; 
C21 ^ C/3 - Pi-, 



Pi ^ All X Bii; 
Ps^ SiX B22; 
P^^SiX Ti- 
Pr^Sax T3; 

U3^U2+ PT: 

C22^Ua + P5: 



P2 ^ A12 X B21; 
P4 ^ A22 X Ti] 

P6^S2X T2; 



C/4^C/2+P5; 



In practice, one uses a threshold in the matrix dimension to switch to a base case 
algorithm, that can be any of the one previously described. Following section 1.2, 
one can again delay the modular reductions, but the intermediate computations of 
Strassen-Winograd's algorithm impose a tighter bound: 
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Theorem 4 [18] Let A £ Z™^'^, B e Z^^^" C G Z™^" and P e Z with 

aij,bij,Cij, P € {0...p~ 1}. r/ien every intermediate value z involved in the 
computation of A x B + j3C with I (I > I) recursive levels of algorithm 1.4 satisfy: 



Moreover, this bound is tight. 

For instance, on a single Xeon 2.8GHz core with gcc-4.6.3, Strassen-Winograd's 
variant implemented with LinBox-1.2.1 and GotoBLAS2-1.13 can be 37% faster for 
the multiplication of 10 000 x 10 000 matrices over F2i9_i, in less than 1'49". 

2 Dense Gaussian elimination and echelon forms 

In this section, we present algorithms computing the determinant and inverse of 
square matrices; the rank, rank profile, nuUspace, and system solving for arbitrary 
shape and rank matrices. All these problems are solved a la Gaussian elimination, 
but recursively in order to effectively incorporate matrix multiplication. The latter is 
denoted generically gemm and, depending on the underlying field, can be implemented 
using any of the techniques of sections 1.1, 1.2 or 1.3. 

A special care is given to the asymptotic time complexities: the exponent is 
reduced to that of matrix multiplication using block recursive algorithms, and the 
constants are also carefully compared. Meanwhile, this approach is also effective for 
implementations: grouping arithmetic operations into matrix-matrix products allow 
to better optimize cache accesses. 

2.1 Building blocks 

Algorithms 2.1, 2.2, 2.3 and 2.4 show how to reduce the computation of triangular 
matrix systems, triangular matrix multiplications, and triangular matrix inversions 
to matrix-matrix multiplication. Note that they do not require any temporary stor- 
age other than the input and output arguments. 

2.2 PLE decomposition 

Dense Gaussian elimination over finite fields can be reduced to matrix multiplication, 
using the usual techniques for the LU decomposition of numerical linear algebra [7] . 
However, in applications over a finite field, the input matrix often has non-generic 
rank profile and special care needs to be taken about linear dependencies and rank 
deficiencies. The PLE decomposition is thus a generalization of the PLU decompo- 
sition for matrices with any rank profile. 

Definition 5 A matrix is in row-echelon form if all its zero rows occupy the last 
row positions and the leading coefficient of any non-zero row except the first one is 
strictly to the right of the leading coefficient of the previous row. Moreover, it is said 
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Algorithm 2.1 [trsm: Triangular System Solve with Matrix right hand side] 
Input: A e F™^™ non-singular upper triangular, 



Output: X e F^^" s.t. AX 



B 



if m=l then return X ^ A^^ -^^ x B end if 
X2 =trsm(A3,B2); 

Bi = B1-A2X2; {using gemm, e.g., via alg. 1.5} 



Xi =trsm(Ai, 



Using the conformal block 
decomposition: 



'Ai A2 










^3. 











return X 



Xo 



Algorithm 2.2 [trmm: Triangular Matrix Multiplication] 

Input: A e F™^™ upper triangular, i? e F^ 
Output: C e F"^" s.t. AB = C 



^mxn 
■ 9 



if m=l then return C — 1 x B end if 

Ci ==trmm(^i,_Bi); 

Ci = C*i + /I2B2; {using gemm} 

C2 =trmin(^3,B2); 



5: return C = 



Co 



Using the conformal block 
decomposition: 



'Ai A2 










^3. 











Algorithm 2.3 [trtri: Triangular Matrix Inversion] 



Input: A e Fq^" upper triangular and non-singular 
Output: C = A 1 



if m=l then return C = A^ \ end if 

Ci = ^r^; {using trtri recursively} 
C3 = ^3^^; {using trtri recursively} 
C2 = A2C^; {using trmm } 
C2 = — C1C2; {using trmm } 
Ci C2 
C3 ' 



1 

2 
3 
4 
5 

6: return C = 



Using the conformal 
decomposition: 



'Ai A2 




Ci C2 


^3. 




^3. 



block 
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Algorithm 2.4 [trtrm: Upper-Lower Triangular Matrix Multiplication] 



Input 
Input 
Output: A 



lower triangular 
upper triangular 



UL 



if m=l then return A — C/i i^i i end if 

Ai = UiLi] {using trtrm recursively} 
Ai = Ai + U2L2; {using gemm} 
A2 = U2L3; {using trmm } 
A3 ~ U3L2] {using trmm } 
A4 = U3L3] {using trtrm recursively} 
'Ai ^2] 

A3 ^4 ' 



1 
2 
3 
4 
5 
6 

7: return A — 



Using the conformal block 
decomposition: 



"il 




Vi U2 




'Ai A2 


L2 -^3 




Us, 


1 


A3 Ai_ 



to be in reduced row- echelon form, if all coefficients above a leading coefficient are 
zeros. 

Definition 6 For any matrix A G qJ' rank r, there is a PLE decomposition 

A = PLE where P is a permutation matrix, L is a m x r lower triangular matrix 
and E is a r X n matrix in row-echelon form, with unit leading coefficients. 

Algorithm 2.5 shows how to compute such a decomposition by a block recursive 
algorithm, thus reducing the complexity to that of matrix multiplication. 



Algorithm 2.5 [PLE decomposition] 



m X n 
9 



Input: Ae¥] 

Output: (P, L,E) a PLE decomposition of A 

1 

2 
3 



4: 
5: 
6: 
7: 
8: 
9: 
10: 

11: 



if n = 1 then 

if A = Omxi then return {Im, Iq, A); end if 

Let j be the column index of the first non zero entry of A and P = Tij the 
transposition between indices 1 and j; 
return {P,PA,[1]); 



else 



{Pi,Li,Ei) = PLE(Ai); {recursively} 
A2 = P1A2; 

^3 = ^rj^a; {using trsm} 

Ai = Ai - Li^2A3] {using gemm} 

iP2,L2,E2) = PLE(A4); {recursively} 



return (Pi 



Split A columnwise in halves: 

A = [Ai A2] 



12: end if 







' Lis 




'El A3 


^2. 


1 


^2^1,2 L2 


7 





Spht A2 = 

where A3 
rows. 

); 



A3 

Ai 



Li 



L12 



and Li i have ri 
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2.3 Echelon forms 



The row-echelon and reduced row-echelon forms can be obtained from the PLE 
decomposition, using additional operations: trsm, trtri and trtrm, as shown in 
algorithm 2.6 and 2.7. 



Algorithm 2.6 [RowEchelon] 



Input: Ae¥] 

Output: {X, E) such that XA = E, X is non-singular and E is in row-echelon form 



mxji 

q 



{P,L,E)^^PLE{A); 

Xi = Lj"^; {using trtri} 

X2 = —L2X1; {using trnim} 



Split L 



L2 



Li : r X r. 



4: return X 



Xo I„ 



P'\E 



Algorithm 2.7 [ReducedRowEcheloii] 



Input: A e F™^" 

Output: (Y, R) such that YA = R, Y is non-singular and R is in reduced row- 
echelon form 
1: (X, = RowEchelon(yl); 

2: Let Q be the permutation matrix that brings the leading row coefficients of E 

to the diagonal; 
3: Set EQ = [Ui U2] ; {where Ui is r x r upper triangular} 
4: Yi = Ui^; {using trtri} 
5: Yi = YiXi; {using trtrm} 
6: i? = [ir Q^; {using trsm} 



7: return Y = 



U2 



P'R 



Figure 1 shows the various steps between the classical Gaussian elimination (LU 
decomposition), the computation of the echelon form and of the reduced echelon 
form, together with the various problems that each of them solve. Table 2 shows 
the leading constant in the asymptotic time complexity of these algorithms, 
assuming that two n x n matrices can be multiplied in C^v!^ -\- o(n'^). 

Remark 7 Note that, if the rank r is very small compared to the dimensions m x n 
of the matrix, a system Ax — b can be solved in time bounded by O ((m + n)r^) [45, 
Theorem 1]. 
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RLE 


A 




2/3 



- Det 




- Rank 




- RankProfile 


- Solve 




\ E 


TRTRI 


l\ 


1/3 


PLE decomp 


A=PLE 





Echelon 
Nullspace basis 



TRTRI 




1/3 





Echelon form decomp 
(TA=E) 



TRTRM 



2/3 



- Reduced 
Echelon form 

- Inverse 



Reduced Echelon 
form decomp 
TA=R 



Figure 1: Reductions from PLE decomposition to Reduced echelon form 



Algorithm 

gemm 

trsm 

trtri 

trtrm, PLE 

Echelon 

RedEchelon 



Constant K,,, 







2 




2 


1 






1 ^ 




-2)(2-^i-l) 


3 ~ 

2 ^ 

3 ~ 
1 

2 




2 2'^-2 




1 2'*'-2 
(2"-i+2) 


(2—1 


-2)(2— 1-1) 



0.33 
0.66 



6 
4 

= 1.6 
2.8 

— sa 4 4 
5 

f =8.8 



5 

14 

5 

22 



Table 2: Complexity of elimination algorithms 



3 Minimal and characteristic polynomial of a dense 
matrix 

Definition 8 1. A Las- Vegas algorithm is a randomized algorithm which is al- 
ways correct. Its expected running is time is always finite. 

2. A Monte-Carlo algorithm is a randomized algorithm which is correct with a 
certain probability. Its running time is deterministic. 

The computation of the minimal and characteristic polynomials is closely related 
to that of the Frobenius normal form. 

Definition 9 Any matrix A G F^^" is similar to a unique block diagonal matrix 
F = P^^AP = diag{Cf-^, . . . ,Cf^) where the blocks Cf. are companion matrices of 
the polynomials fi, which satisfy fi+i\fi. The fi are the invariant factors of A and 
F is the Frobenius normal form of A. 

Most algorithms computing the minimal and characteristic polynomial or the Frobe- 
nius normal form rely on Krylov basis computations. 

Definition 10 1. The Krylov matrix of order d for a vector v w.r.t a matrix A 
is the matrix KA,v,d = [v Av ... A'^~^v\ G F^^''. 
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2. The minimal polynomial P^l„ of A and v is the least degree monic polynomial 
P such that P{A)v = 0. 



Theorem 11 1. AKA,v,d = KA,v,dCpA.^, where d = deg(P^:^) 



2. For lineraly independent vectors {vi, . . . , Vk), if K — [KA,vi.di 



Ka. 



is non singular. Then AK = K 



C 



B2A 



Bi, 



B. 



fe,i 



B 



Bi^k 

B2.k 



k.2 



where the 



blocks Bi j are zero except on the last column. 



3. For linearly independent vectors (wi, . . . ,Vk), let (di, . . . dk) be the lexicograph- 



ically largest sequence of degrees such that K = \j^A,vi,di 
non-singular. Then 



Ka, 



R-^AK = 



Bi. 



Bi^k 

B2,k 



= H 



(1) 



Remark 12 1. Some choice of vectors ui, . . . ,Vk lead to a matrix H block diag- 
onal: this is the Frobenius normal form [27] 

2. The matrix obtained at equation (1) is called a Hessenberg form. It suffices to 
compute the characteristic polynomial from its diagonal blocks. 

Theorem 13 The Frobenius normal form can be computed: 

1. by a deterministic algorithm [49] in Gn"^ + O (n^ log'^ n) field operations, (only 
(2 + l)^^'^ + O (n^) for the characteristic polynomial [19]) 

2. by a deterministic algorithm [4-8] in O (n" log n log log 71), together with a trans- 
formation matrix, (only O (n" log 71) for the characteristic polynomial [4O] ) 

3. by a Las- Vegas algorithm [22] in O (n" log n) field operations for any field, 
together with a transformation matrix 

4. by a Las- Vegas algorithm [47] in O {n'^) for q > 2n^, without transformation 
matrix. 

The minimal and characteristic polynomials, obtained as the first invariant factor 
and the product of all invariant factors, can be computed with the same complexities. 

Remark 14 These algorithms are all based Krylov bases. Algorithm (1.) iteratively 
compute the Krylov iterates one after the other. Their cubic time complexity with 
a small leading constant makes them comparable to Gaussian elimination. A fast 
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exponentiation scheme by Keller-Gehrig [40] achieves a sub-cubic time complexity 
for the characteristic polynomial, off by a logarithmic factor of n from the matrix 
multiplication. The choice for the appropriate vectors that will generate the Frobenius 
normal form can be done either probabilistically (Las-Vegas) or deterministically with 
an log log n factor. Algorithm (4.) uses a different iteration where the size of the 
Krylov increases according to an arithmetic progression rather than geometric ( as all 
others ) and the transformation matrix is not computed. This allows it to match to the 
complexity of matrix multiplication. This reduction is practical and is implemented 
as in LinBox. 

Remark 15 These probabilistic algorithms depend on the ability to sample uni- 
formly from a large set of coefficients from the field. Over small fields, it is always 
possible to embed the problem into an extension field, in order to make the random 
sampling set sufficiently large. In the worst case, this could add a O (log(n)) factor 
to the arithmetic cost and prevent most of the bit-packing techniques. Instead, the 
effort of [22] is to handle cleanly the small finite field case. 

4 Blackbox iterative methods 

We consider now the case where the input matrix is sparse, i.e., has many zero ele- 
ments, or has a structure which enables fast matrix-vector products. Gaussian elim- 
ination would fill-in the sparse matrix or modify the interesting structure. Therefore 
one can use iterative methods instead which only use matrix- vector iterations ( black- 
box methods [38]). There are two major differences with numerical iterative routines: 
over finite fields there exists isotropic vectors and there is no notion of convergence, 
hence the iteration must proceed until exactness of the result [42]. Probabilistic 
early termination can nonetheless be applied when the degree of the minimal poly- 
nomial is smaller than the dimension of the matrix [34, 20, 23]. More generally the 
probabilistic nature of the algorithms presented in this section is subtle: e.g., the 
computation of the minimal polynomial is Monte-Carlo, but that of system solving, 
using the minimal polynomial, is Las- Vegas (by checking consistency of the produced 
solution with the system) . Making some of the Monte-Carlo solutions Las- Vegas is 
a key open-problem in this area. 

4.1 Minimal Polynomial and the Wiedemann algorithm 

The first iterative algorithm and its analysis are due to D. Wiedemann [58]. The al- 
gorithm computes the minimal polynomial in the Monte-Carlo probabilistic fashion. 

Definition 16 For a linearly recurring sequence S — (Si), its minimal polynomial 
is denoted by Us. 

• The minimal polynomial of a matrix is denoted Ha = H(A')- 

• For a matrix A and a vector b, we note IlA,b — ^lA^-b)- 
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• With another vector u, we note Tiu,A.b = '^{u^-A^-b)- 



Algorithm 4.1 [Wiedemann minimal polynomial] 
Input: A e F^^", u, & e F^. 
Output: n„ A,b- 
1: Compute 5* = {u^ A^b) for i < 2n; 

2: Use the Berlekamp-Massey algorithm to compute the minimal polynomial of the 
scalar sequence S; 



Definition 17 We extend Euler's totient function by $q.fc(/) ~ Yli^^l '^'^'), where 
di are the degrees of the distinct monic irreducible factors of the polynomial f . 

Theorem 18 For vectors ui,...,Uj selected uniformly at random, the probability 
that lcm(n„^_A,&) = n^.h is at least $,,fc(n^_f,). 

Theorem 19 For vectors bi,...,bk selected uniformly at random, the probability 
that lcm(n^ f,i) = is at least ^q.ki^A)- 

4.2 Rank, Determinant and Characteristic Polynomial 

It is possible to compute the rank, determinant, and characteristic polynomial of a 
matrix from its minimal polynomial. All these reductions require to precondition 
the matrix so that the minimal polynomial of the obtained matrix will reveal the 
information sought, while keeping a low cost for the matrix-vector product [25, 37, 
20, 52, 55, 56, 9]. 

Theorem 20 [25] Let S be a finite subset of a field F that does not include 0. Let 
A e F™^" having rank r. Let Di G S*"^" and D2 G 5""><™ be two random diagonal 
matrices then degree{minpoly{Di x A* x D2 x A x Di)) — r, with probability at least 

-1 11. n^—n 

^ 2|5| ■ 

Theorem 21 [52] Let S be a finite subset of a field F that does not include 0. Let 

U G gnxn g y^j^ upper bi-diagonal matrix where the second diagonal elements 

wi, . . . , Un-i are randomly selected in S. For A £ F"^"^ the term of degree of the 

2 

minimal polynomial of UA is the determinant of A with probability at least 1 — . 

Remark 22 // A is known to be non-singular the algorithm can be repeated with 
different matrices U until the obtained minimal polynomial is of degree n. Then it is 
the characteristic polynomial of UA and the determinant is certified. Alternatively 
if the matrix is singular then X divides the minimal polynomial. As Wiedemann's 
algorithm always returns a factor of the true minimal polynomial, and U is invertible, 
the algorithm can be repeated on UA until either the obtained polynomial is of degree 
n or it is divisible by X. Overall the determinant has a Las-Vegas blackbox solution. 
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Theorem 23 [55, 56] Let S he a finite subset of a field F that does not include 
and A e F"^" with si,.. . ,st as invariant factors. Let U S 5'"x'= and V G s''^" 
be randomly chosen rank k matrices in F. Then gcd{IlA,^A+uv) — Sfe+i with 
probability at least 1 - . 

Remark 24 Using the divisibility of the invariant factors and the fact that their 
product is of degree n, one can see that the number of degree changes between suc- 
cessive invariant factors is of order O /55/. Thus by a binary search over 
successive applications of theorem 23 one can recover all of the invariant factors and 
thus the characteristic polynomial of the matrix in a Monte- Carlo fashion. 

4.3 System solving and the Lanczos algorithm 

For the solution of a linear system Ax — b, one could compute the minimal polyno- 
mial f, and then derive a solution of the system as a linear combination of the 
A^b. The following Lanczos approach is more efficient for system solving as it avoids 
recomputing (or storing) the latter vectors [25, 28]. 



Algorithm 4.2 [Lanczos system solving] 

Input: A e F™^", b e F". 

Output: X e F" such that Ax = 6 or failure. 

1: Let A = DiA^ D2AD1 and b = DiA^ D2b+Av with Di and D2 random diagonal 
matrices and v a random vector; 

2: wq = b; vi ^ Awq- to = vfwo; 7 = 6*^0^0 ^ xq = JWq; 

3: repeat 

4: a = vf^^v^+it^^; P = vJ^^Vit[[\; w^+i = u^+i - awi - /3wi_i; 

5: V.i+2 = AWi+i] ti+i = wl^j^Vi+2; 

7: until Wi+i = or ti+i — 0; 
8: Return x — Di{xi+i — v); 



The probability of success of algorithm 4.2 follows also theorem 20. 

Remark 25 Over small fields, if the rank of the matrix is known, the diagonal 
matrices of line 1 can be replaced by sparse preconditioners with O {n\og{n)) non- 
zero coefficients to avoid the need of field extensions [9, corollary 7.3]. 

Remark 26 // the system with A and b is known to have a solution then the algo- 
rithm can be turned Las-Vegas by checking that the output x indeed satisfies Ax — b. 
In general, we do not know if this algorithm returns failure because of bad random 
choices or because the system is inconsistent. However, Giesbrecht, Lobo and Saun- 
ders have shown that when the system is inconsistent, it is possible to produce a 
certificate vector u such that u'^ A = together with u'^b ^ within the same com- 
plexity [28, Theorem 2.4]. Overall, system solving can be performed by blackbox 
algorithms in a Las- Vegas fashion. 
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5 Sparse and structured methods 



Another approach to sparse hnear system is to use Gaussian ehmination with pivot- 
ing, taking into account the zero coefhcients. This algorithm modifies the structure 
of the matrix and might suffer from fiU-in. Consequently the available memory 
is usually the bottleneck. From a triangularization one can naturally derive the 
rank, determinant, system solving and nullspace. Comparisons with the blackbox 
approaches above can be found e.g., in [20]. 

5.1 Reordering 



Algorithm 5.1 [Gaussian elimination with linear pivoting] 
Input: a matrix A e F™^"; 

Output: An upper triangular matrix U such that there exists a unitary lower- 
triangular matrix L and permutations matrices P and Q over F, with A — 
P-L-U -Q; 
1: for all elimination steps do 

2: Choose as pivot row the sparsest remaining row; 

3: In this row choose the non zero pivot with lowest number of non zero elements 

in its column; 
4: Eliminate using this pivot; 
5; end for 



Remark 27 Yannakakis showed that finding the minimal fill-in ( or equivalently the 
best pivots) during Gaussian elimination is an NP- complete task [60]. In numerical 
algorithms, heuristics have been developed and comprise minimal degree ordering, 
cost functions or nested dissection (see e.g., [61, 1, 31]). These heuristics for reduc- 
ing fill-in in the numerical setting, often assume symmetric and invertible matrices, 
and do not take into account that new zeros may be produced by elimination oper- 
ations (oij — Oij -\- Si * o,kj), o,s is the case with matrices over finite fields. [20] 
thus proposed the heuristic 5.1 to take those new zeros into account, using a local 
optimization of a cost function at each elimination step. 

5.2 Structured matrices and displacement rank 

Originating from the seminal paper [33] most of the algorithms dealing with struc- 
tured matrices use the displacement rank approach [46]. 

Definition 28 For A £ F™^" and B e F"^", the Sylvester (resp. Stein) linear 
displacement operator Va,b (resp. ^a,b) satisfy for M G f™^"; 

V^^b(M) ^ AM-MB 
AA,BiM) = M -AMB 

A pair of matrices {Y, Z) £ F™^" x F"^" is a A, i?-Sylvester-generator of length a 
(resp. Stem) for M if\/A,B{M) = Y (resp. Aa,s(M) = YZ'^). 
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The main idea behind algorithms for structured matrices is to use such generators 
as a compact data structure, in cases where the displacement has low rank. 

Usual choices of matrices A and B are diagonal matrices and cyclic down shift 
matrices: 

Definition 29 D^,a; G F" is the diagonal matrix whose (i,i) entry is Xi. 

^n,¥3j</' S F is the n X n unit circulant matrix having Lp at position ones in 

the suhdiagonal {i + l,i) and zeros elsewhere. 



operator matrices 


class of structured 


rank of 


number of flops 




A 


B 


matrices M 




for computing M 


V 






Toeplitz and its inverse 


< 2 


0{{m- 


f n) log(m + 


n)) 






Hankel and its inverse 


< 2 


0{{jn- 


f n) log(TO + 


n)) 






Toeplitz+Hankel 


< 4 


0{{m- 


f n) log(r7i + 


n)) 






Vandermonde 


< 1 




- n) log^ (to 4 


-n)) 






inverse of Vandermonde 


< 1 




- n) log^ (to 4 


-n)) 


^n.O 




transposed of Vandermonde 


< 1 




- n) log^ (to 4 


-n)) 






Cauchy and its inverse 


< 1 


0{{m^ 


- n) log (to 4 


-n)) 



Table 3: Complexity of the matrix-vector product for some structured matrices 



As computing matrix vector products with such structured matrices have close 
algorithmic correlation to computations with polynomials and rational functions, 
these matrices can be multiplied by vectors fast, in nearly linear time as shown on 
table 3. Therefore the algorithms of section 4 can naturally be applied to structured 
matrices, to yield almost O (n^) time linear algebra. 

Now, if the displacement rank is small there exists algorithms quasi linear in n, 
the dimension of the matrices, which over finite fields are essentially variations or 
extensions of the Morf/Bitmead- Anderson divide-and-conquer [44, 4] or Cardinal's 
[8] approaches. The method is based on dividing the original problem repeatedly 
into two subproblems with one leading principal submatrix and the related Schur 
complement. This leads to O (^a'^n-^^°'-^'>) system solvers, which complexity bound 
have recently been reduced to O [6, 32]. We few exceptions, all al- 

gorithms thus need matrices in generic rank profile. Over finite fields this can be 
achieved using Kaltofen and Saunders unit upper triangular Toeplitz preconditioners 
[37] and by controlling the displacement rank growth and non-singularity issues [35] . 

6 Hybrid methods 

6.1 Hybrid sparse-dense methods 

Overall, as long as the matrix fits into memory, Gaussian elimination methods are 
usually faster than iterative methods, over finite fields [20]. There are then heuristics 
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trying to take the best of both strategies. Among those we briefly mention the most 
widely used: 

• Perform the Gaussian elimination with reordering 5.1 until the matrix is almost 
filled-up. If the remaining non-eliminated part would fit as a dense matrix, 
switch to the dense methods of section 2. 

• Maintain two sets of rows (or columns), sparse and dense. Favor elimination 
on the sparse set. This is particularly adapted to index calculus [41]. 

• Perform a preliminary reordering in order to cut the matrix into four quadrants, 
the upper left one being triangular. This, together with the above strategies has 
proven effective on matrices which are already quasi-triangular, e.g., Grobner 
bases computations in finite fields [26] . 

• If the rank is very small compared to the dimension of the matrix, one can use 
left and right highly rectangular projections to manipulate smaller structures 
[43]. 

• The arithmetic cost and thus timing predictions are easier on iterative meth- 
ods than on elimination methods. On the other hand the number of non-zero 
elements at a given point of the elimination is usually increasing during an 
elimination, thus providing a lower bound on the remaining time to trian- 
gularize. Thus a heuristic is to perform one matrix-vector product with the 
original matrix and then eliminate using Gaussian elimination. If at one point 
the lower bound for elimination time surpasses to predicted iterative one or if 
the the algorithm runs out of memory, stop the elimination and switch to the 
iterative methods [21]. 

6.2 Block- iterative methods 

Iterative methods based on one-dimensional projections, such as Wiedmann and 
Lanczos algorithm can be generalized with block projections. Via efficient precon- 
ditioning [9] these extensions to the scalar iterative methods can present enhanced 
properties: 

• Usage of dense sub-blocks, after multiplications of blocks of vectors with the 
sparse matrix or the blackboxes, allows for a better locality and optimization 
of memory accesses, via the application of the methods of section 1. 

• Applying the matrix to several vectors simultaneously introduces more paral- 
lelism [10, 11, 34]. 

• Also, their probability of success augments with the size of the considered 
blocks, especially over small fields [36, 54]. 

Definition 30 Let X e Fj^", Y e F^^'= and H, = XA^Y for i = 0...n/k. The 

matrix minimal polynomial of the sequence Hi is the matrix polynomial Fx^a,y € 
Fg[X]'^^'^ of least degree, with its leading degree matrix column-reduced, that annihi- 
lates the sequence {Hi). 
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Theorem 31 The degree d matrix minimal polynomial of a block sequence (Hi) S 
^pk-Kky^ (,an be computed in O {k^d^^ using block versions of Hermite-Pade approx- 
imation and extended Euclidean algorithm [3] or Berlkamp-Massey algorithm [11, 
36, 54]- Further improvement by [3, 51, 29, 39] bring this complexity down to 
O {k^ d)^^°^^\ using a matrix extended Euclidean algorithm. 



Algorithm 6.1 [Nullspace vector] 
Input: A e F^^" 

Output: w e a vector in the nullspace of A 
Pick XGF^^",reF^^'= uniformly at random; 
Compute the sequence Hi = XA^Y; 
Compute Fx,A,Y the matrix minimal polynomial; 
Let / — frX^ + • • • + fdx'^ be a column of Fx,a.y', 
Return uj = Yfr + AY fr+i + ■■■+ A'^-'Yfd' ^ 



Remark 32 These block-Krylov techniques are used to achieve the best known time 
complexities for several computations with black-box matrices over a finite field or 
the ring of integers: computing the determinant, the characteristic polynomial [39] 
and the solution of a linear system of equations [24]- 
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